In this project, we will look into Car Collision Report data, and analyze in detail what this data represents. We will make graphs, to visually analyze and find patterns, make predictions, and come to a better understanding on how we can reduce the amount of car collisions.
The data is obtained from dataMontgomery(https://data.montgomerycountymd.gov/Public-Safety/Crash-Reporting-Drivers-Data/mmzv-x632/data), a website that keeps records from the County of Montgomery in Maryland. The data has more than 67,000 enteries, each being a driver. The data has been collected from January 1, 2015 and is updated weekly. It has 32 columns, mostly plain-text data about the circumstance involving the accidents, such as location, vehicle type, vehicle movement, type of road and date of accident. One thing to take into account is that some of the information is self-reported and has not been varafied by a neutral party.
> #File locaiton on Dario's Machine
> dat = read.csv("/Users/dariomolina/DataScience/Crash-Report-Project/Crash_Reporting_-_Drivers_Data.csv",stringsAsFactors=FALSE)
> #Address on Evert's Machine
>
> #dat = read.csv("/Users/evert/Dropbox/School/Assignments/Crash_Reporting_-_Drivers_Data.csv",stringsAsFactors=FALSE)
Changing empty blanks to 0 in Off.Road.Description, Related.Non.Motorist, Non.Motorist.Substance.Abuse columns
> x = c("Off.Road.Description", "Related.Non.Motorist", "Non.Motorist.Substance.Abuse")
>
> #assigning value 0 to "" in x vector
> for( i in x)
+ {
+ dat[[i]][dat[[i]] == ""] = 0
+ }
We noticed that there were some columns with almost no data that were either empty to had an “N/A” string. We decided to remove the empty blanks and “N/A” in data set and replace with to N/A value
> #assignig rest values to NA
> for(i in names(dat))
+ {
+ dat[[i]][dat[[i]] == "" | dat[[i]] == "N/A"] = NA
+ }
>
>
>
> na_count <-sapply(dat, function(y) sum(length(which(is.na(y)))))
> na_count <- data.frame(sort(na_count,decreasing = TRUE))
> print(head(na_count,7))
sort.na_count..decreasing...TRUE.
Municipality 59207
Circumstance 53006
Driver.Substance.Abuse 10853
Traffic.Control 10019
Surface.Condition 7644
Cross.Street.Type 6187
Route.Type 6184
We noted that nearly all of the enteries (more than 50K) did not have a value for Non-Moterist Substance Abuse, Related Non-Motorist, Off-Road Description, Municipality, Circumstance so we decided to focus on questions where these enteries were not as relevant.
Creating a new dataframe containing incidents in which a bicyclist was involved and the driver was at fault and NOT distracted. (194 data points) We also created a seperate dataframe without this subset.
> dat_bi_daf = dat[dat$Related.Non.Motorist == "BICYCLIST" & dat$Driver.At.Fault == "Yes",]
>
> dat.sub1 <- subset(dat_bi_daf, Driver.Distracted.By == "LOOKED BUT DID NOT SEE")
> dat.sub2 <- subset(dat_bi_daf, Driver.Distracted.By == "NOT DISTRACTED")
>
> dat.sub <- rbind(dat.sub1,dat.sub2)
> dat.subwb = dat[!(dat$Report.Number %in% dat.sub$Report.Number),]
Before we started analyzing the data in detail, and since we are dealing with crash report type, we wanted to figure out which crash types were the most common.
> par(mar=c(3,14,2,2))
> x = table(dat$Collision.Type)
> barplot(x[x > 5000], horiz=TRUE, las=1, main=" ", col="firebrick")
From the graph, we realized that the most common type of car crash is same direction rear end. Which we assumed it would be either same direction rear end, or face to face collision like when two cars in opposite direction crash.
We then thought, since we know what are the most common types of car crash, what time during the year do they happen the most? In what month are car crashes most common? Do they rely on seasons, by festivities?
> #amount of crashes per month
> x =table(substr(dat$Crash.Date.Time,1,2))
> names(x) = c("Jan.","Feb.","Mar.","Apr.","May","Jun.","Jul.","Aug.","Sep.","Oct.","Nov.","Dec.")
> barplot(x, las = 1, col="firebrick")
From the graph we can see January is when most crashes happen. We had previously anticipated that perhaps most crashes would happen in the summer, since students are out from school, people go out on vacation, perhaps June and July would be the highest. But to our surprise January had the most. New Year resolution, new year, new car crash.
We were interested in figuring out, from the accidents reported, what were the most common types of report types. Because of the crashes, did many more fatalities happened than property damage? Which type of crash happend had the most different types of report types?
> y = table(dat$ACRS.Report.Type, dat$Collision.Type)
> x= c("STRAIGHT MOVEMENT ANGLE","SINGLE VEHICLE","SAME DIR REAR END","OTHER","HEAD ON LEFT TURN")
> y = y[,x]
>
> par(mar=c(3,2,3,2))
> plot(y, las=1,col="firebrick", main=" ")
After exploring the data further into detail, we were interested in finding out for the most common types of crashes, what were the most common types of reports produced. My partner and I thought fatality was going to have a bigger impact, but to our surprise it had the least impact.
Since we had analyzed the different most common types of car crash, and what time in the year they happen the most, we were curious to see under what kind of weather, what types of car crashes happend the most.
> z = table(dat$Weather,dat$Collision.Type)
> y= z[,c("STRAIGHT MOVEMENT ANGLE","SINGLE VEHICLE","SAME DIR REAR END","SAME DIRECTION RIGHT TURN","SAME DIRECTION LEFT TURN","HEAD ON LEFT TURN","OTHER")]
> y = y[c("CLEAR","CLOUDY","RAINING"),]
>
> par(mar=c(4,.5,4,.3))
> plot(y, main=" ", col="firebrick", las=1)
From the graph derived from the data, we were able to identify that same direction rear end, had the most types of crashes under clear weather. And clear weather overall was the most common type of weather were car crashes happened. Personally, my team mate and I thought rainy weather would be the most common type of weather where most crashes would happen since the rain can make tires have less grip on the road and be more prone to accidents.
We explored the data in greater detail for we wanted to figure if the surface condition when a car crash, correlated with the type of weather when a crash was reported. Since clear weather was most common, we would expect to have dry surface to be most common. We should expect a direct correlation between weather and surface condtion.
> #crashes by surface condition
> x = table(dat$Surface.Condition,dat$Collision.Type)
>
> cols = c("HEAD ON LEFT TURN","OTHER","SAME DIR REAR END","SAME DIRECTION SIDESWIPE","SINGLE VEHICLE","STRAIGHT MOVEMENT ANGLE")
> rows = c("DRY","WET","SNOW")
> z = x[rows,cols]
> plot(z, main=" ", las=1, col="firebrick")
Based on the graph, our expectations were met. And we also noted how same direction rear end still was the most common type of crash under the different types of surface condtion. We also wanted to verify that wet surface condtion had the same correlation to rainy weather.
For this section, we were intrested in accidents involving bicycles. We decided to only look at the incidents in which the driver was at fault and not distracted because these situations are the ones where we can hopefully correct. This gave us almost 200 data points to analyze. We did some inital plots to try and find some pattern to these accidents.This first graph shows the total number of accidents that occured for each month.
> par(mar=c(2,9,2,2))
> x =table(substr(dat.sub$Crash.Date.Time,1,2))
> names(x) = c("Jan.","Feb.","Mar.","Apr.","May","Jun.","Jul.","Aug.","Sep.","Oct.","Nov.","Dec.")
> barplot(x, las = 1, col="blue4", main="Num. of Accidents Per Month")
Not too surprisingly, the amount of crashes increases throughout the year as the weather is more suitable for riding bicyles, and dips down during the winter season.
After the frist graph, we wondered if the number crashes would also increase during times of heavy traffic during a day, such as when people are commuting to or from work/school.
> #formating the Crash.Date.Time Column
> dat.sub$Crash.Date.Time <- as.POSIXlt(as.character(dat.sub$Crash.Date.Time), format = "%m/%d/%Y %I:%M:%S %p")
> par(mar=c(4,2.5,2,2))
> y = table(substr(dat.sub$Crash.Date.Time,12,13))
> names(y) = c("12am","1am","5am","6am","7am","8am","9am","10am","11am","12pm","1pm","2pm","3pm","4pm","5pm","6pm","7pm","8pm","9pm","10pm","11pm")
> barplot(y,las=1,col = "blue4",main="Total Numb. of Accidents Per Each Hour", xlab = "Hour")
The graph confirmed our inital suspisions by showing that the number of crashes spiked during common comute times.
Next, we decided to see if we could gain any insights in regards to where these accidents happen so we plotted them onto a map of the county. Once again, we compared the map of bicycle related accidents to the rest of the data.
> library(ggmap)
>
> par(mfrow=c(2,2))
> map = get_map(c(-77.08025,39.07991),maptype = "road",zoom=11)
> p = ggmap(map)
> p = p + geom_point(data=dat.sub, aes(x=Longitude,y=Latitude),color="blue4",size=3)
> print(p)
> map = get_map(c(-77.08025,39.07991),maptype = "road",zoom=13)
> j = ggmap(map)
> j = j + geom_point(data=dat.subwb, aes(x=Longitude,y=Latitude),color="blue4",size=3)
> print(j)
Although not too visable from the small graph, it seemed to indicate that a significant amount of the incidents occured at or near an intersection.
Based on the results of the previous map, we decided to graph the movement of the vehicle. To better get an insight into this, it was compared to the movement of the vehicle in the overall dataset.
> par(mfrow=c(2,1))
> par(mar=c(2,13,2,2))
> barplot(head(sort(table(dat.sub$Vehicle.Movement),decreasing = TRUE)),col="blue4",las=1,horiz = TRUE,main="Vehicle Movement w/ Bicycle Involvment")
>
> par(mar=c(2,13,2,2))
> barplot(head(sort(table(dat.subwb$Vehicle.Movement),decreasing = TRUE)),col="blue4",las=1,horiz = TRUE,main="Vehicle Movement")
From the plot, we learned that at least about 1/3 of the incidents involved the vehicle making some sort of turn when a bicylce is involved. This is in contrast to the rest of dataset in which only one of the top six vehicle movements involves the vehicle during a turn.
Finally, we decided to generalize the vehicle movement category into four categories in order to simplify maping the points. The turning sub-category indicates incidents in which the vehicle was making any type of turn (left,right,U,etc.), the accelerating sub-category shows incidents where the vehicle was either changing speed (slowing dow/speeding up) or changing lanes (including merging into traffic).
> turning=c("MAKING LEFT TURN", "MAKING RIGHT TURN","MAKING U TURN","RIGHT TURN ON RED")
>
> acc = c("ACCELERATING","ENTERING TRAFFIC LANE","STARTING FROM PARKED","STARTING FROM PARKED","STARTING FROM LANE", "SLOWING OR STOPPING","CHANGING LANES")
>
> stat = c("PARKED","STOPPED IN TRAFFIC LANE")
>
> other = c("BACKING","SKIDDING", "PASSING" , "MOVING CONSTANT SPEED")
>
> dat.sub$Vehicle.Movement[dat.sub$Vehicle.Movement %in% turning] <- "TURNING"
> dat.sub$Vehicle.Movement[dat.sub$Vehicle.Movement %in% acc] <- "ACCELERATING"
> dat.sub$Vehicle.Movement[dat.sub$Vehicle.Movement %in% other] <- "OTHER"
> dat.sub$Vehicle.Movement[dat.sub$Vehicle.Movement %in% stat] <- "STATIONARY"
>
>
> map = get_map(c(-77.08025,39.07991),maptype = "road",zoom=11)
> p = ggmap(map)
> p = p + geom_point(data=dat.sub, aes(x=Longitude,y=Latitude, color = Vehicle.Movement),size=3)
>
> print(p)
From this last map, we observed that accidents where the vehicle was turning seem to be a bit more prone to clustering up, at least more than the other categories. This might indicate that there are certain locations or intersections in the county where it might be dificult to see a bicyclist, or a road where there is no designated bicycle lane.